Thinking About Data

Author
Affiliation

Dave Clark

Binghamton University

Published

January 21, 2025

Get to know your data, explore etc

These slides are intended to get you thinking about the data you are using, what it tells us, what it doesn’t etc., and to think carefully about what we’re trying to measure, and what we’re actually measuring.

code
# anscombes quartet 
longanscombe <- anscombe %>%

pivot_longer(cols = everything(), #pivot all the columns
             cols_vary = "slowest", #keep the datasets together 
              names_to = c(".value", "set"), #new var names; .value=stem of vars
              names_pattern = "(.)(.)") #to extract var names 


kable(
  list(anscombe),
  caption="Anscombe's Quartet",
  booktabs = TRUE,
  valign = 't',
  row.names = FALSE,
)
Anscombe’s Quartet
x1 x2 x3 x4 y1 y2 y3 y4
10 10 10 8 8.04 9.14 7.46 6.58
8 8 8 8 6.95 8.14 6.77 5.76
13 13 13 8 7.58 8.74 12.74 7.71
9 9 9 8 8.81 8.77 7.11 8.84
11 11 11 8 8.33 9.26 7.81 8.47
14 14 14 8 9.96 8.10 8.84 7.04
6 6 6 8 7.24 6.13 6.08 5.25
4 4 4 19 4.26 3.10 5.39 12.50
12 12 12 8 10.84 9.13 8.15 5.56
7 7 7 8 4.82 7.26 6.42 7.91
5 5 5 8 5.68 4.74 5.73 6.89
code
B <- data.frame(set=numeric(0), b0=numeric(0), b1=numeric(0))
for (i in longanscombe$set) {
    m <- lm(y ~ x, data=longanscombe %>% filter(set==i))
    B[i:i,] <- data.frame(i, coef(m)[1], coef(m)[2])
}


kable(
  list(B),
  caption="Anscombe's Quartet",
  booktabs = TRUE,
  valign = 't',
  row.names = FALSE,
)
Anscombe’s Quartet
set b0 b1
1 3.000091 0.5000909
2 3.000909 0.5000000
3 3.002454 0.4997273
4 3.001727 0.4999091
code
ggplot(longanscombe, aes(x = x, y = y)) +
  geom_point() + 
  facet_wrap(~set) +
  geom_smooth(method = "lm", se = FALSE, color="red")

Data Generating Process

  • What produced the data we observe?
    • political process, actors, etc.
    • are those actors purposeful wrt the observed data? That is, do the actors have
    • data collector; choices, biases, mistakes.

Data Generating Process

  • Why do we observe the data we see and not the data we don’t?
    • existence is not randomly determined.
    • research questions are usually about things that happen, not things that do not or cannot.
    • reporting itself is a political process.
    • reporting is shaped by resources

The data generating process is the complete description of how the observed data arose and how other such data would arise. It includes variables, conditionals, functional forms, mappings from one unit to another, etc.

Thinking about the DGP

  • What are the units of observation? Who is taking action or having action done to them? Are the units heterogeneous, and if so, how?

  • What circumstances are the units in?

  • What are the units’ choice sets?

  • What relates the units circumstances to the outcomes?

  • How are the units related to each other?

  • What don’t we observe?

Can the DGP exist?

  • can the units experience the causal claim in question?

A Terrible Map


  • does the data represent the entire DGP or just part of it?

War outcomes

Outcome Autocrat Democrat Total
Loser 42 9 51
Winner 32 38 70
Total 74 47 121

From Lake (1992), p. 31


Observability

  • are we observing the entire phenomenon?


Asking the right question

  • are we asking the right question? Are we making Type 3 errors - finding the right answer to the wrong question? Kennedy (2002)(p. 572) writes:

A type III error, introduced in Kimball (1957), occurs when a researcher produces the right answer to the wrong question. A corollary of this rule, as noted by Chatfield (1995, p. 9), is that an approximate answer to the right question is worth a great deal more than a precise answer to the wrong question.


Data

Collections of alike units, their characteristics, features, choice sets, behaviors, etc.

  • what are the units? In what ways are they heterogeneous?
  • what units are included? Which ones are missing? Why?
  • what do the variables measure?
  • how are the variables measured?
  • what observations are missing? Why?
  • what is the sample; what is the population (sampling frame) from which the sample is drawn?

Types of variables

Variables are either

  • discrete - observations match to integers; all possible values are clearly distinguishable; not divisible. E.g., number of protests in DC this year; an individual’s sex; Polity score.

  • continuous - observations can take on any real value between boundaries (sometimes $-, +); infinitely divisible. E.g., household income, GDP per capita.

Discrete variables

May be of two types or levels of measurement:

  • nominal - categories are distinct, but lack order. E.g., religion = Hindu, Muslim, Catholic, Protestent, Jewish. Binary variables are nominal, e.g., Sex = male (0), female (1); do you have blue eyes? yes (0), no (1).

  • ordinal - take on countable values, increasing/decreasing in some dimension. E.g., Polity -10, -9, \(\ldots\) 0, 1, \(\ldots\) 9, 10 increasing in democracy; survey responses “Do you feel safe traveling abroad?” Not at all; sometimes; yes, completely.

Continuous variables

Can be of two types (levels of measurement):

  • interval - 1 unit increase has same meaning across the scale (i.e., the intervals are the same); e.g., degrees Celsius or Fahrenheit.

  • ratio - intervals but also has a meaningful absolute zero; e.g., weight in pounds; zero lbs indicates the absence of weight; Venmo balance = zero, means actually no money; degrees Kelvin. Duration of a war in days - zero days means there’s no war.

Levels of measurement

These four levels or measurement can be ordered by the amount of information a variable contains:

  • nominal

  • ordinal

  • interval

  • ratio

We can turn higher levels to lower levels, but not the opposite - doing so sacrifices information.

Levels of Measurement and Models

In general, the level of measurement of \(y\) (so the type and amount of information in a variable) shapes what type of model is appropriate.

  • discrete variables usually require statistics/models in the Binomial family (for our purposes, mostly MLE models like the Logit.)

  • continuous variables usually require statistics/models in the Normal/Gaussian family (for our purposes, mostly OLS models like the linear regression.)


Describe these data

Continuous or discrete; what can you say about these data from their observed distribution?

code
gap <- read.csv("/Users/dave/documents/teaching/501/2024/slides/L1-data/data/gapminder.csv")

ggplot(gap, aes(x=alcohol_consumption_per_adult_15plus_litres)) +
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#FF6666") +
   labs(x = "Liters of Booze", y= "Density", caption="Alcohol Consumption, Gapminder")+
  ggtitle("Density - Alcohol Consumption per Adult (Liters)") 

code
gap$nalc <- dnorm(gap$alcohol_consumption_per_adult_15plus_litres, mean=mean(gap$alcohol_consumption_per_adult_15plus_litres, na.rm = TRUE), sd=sd(gap$alcohol_consumption_per_adult_15plus_litres, na.rm = TRUE))

ggplot() + 
  geom_histogram(data=gap, aes(x=alcohol_consumption_per_adult_15plus_litres, y=..density..), colour="black", fill="white") +
  geom_density(data=gap, aes(x=alcohol_consumption_per_adult_15plus_litres), alpha=.2, fill="#FF6666")  +
 geom_line(data=gap, aes(x=alcohol_consumption_per_adult_15plus_litres, y=nalc), linetype="longdash", size=1)+
   labs(x = "Liters of Booze", y= "Density", caption="Alcohol Consumption, Gapminder")+
  ggtitle("Alcohol Consumption per Adult (Liters), Normal PDF")


Describe these data

Polity Scores

code
polity <- polity %>% mutate(era = ifelse(year==1980, 1980, ifelse(year==2018, 2018, 0)))

p <- ggplot(data=polity %>% filter(era!=0), aes(y=polity2), colour="black", fill="white")+
  geom_bar()+
  geom_text(aes(label = ..count..),stat="count", hjust = -.2, colour = "black", size=2.5, 
            position = position_dodge(0.9)) 

p + facet_wrap(era ~ .) +
  labs(y = "Polity score", x= "Countries", caption="Polity project")+
  ggtitle("Polity Scores, 1980 and 2018") 

Overstaying Terms

Two variables - Polity, and the number of efforts (successful, unsuccessful) by state leaders to overstay their terms in office.

expand for full code
#polity <- read_dta("/Users/dave/Documents/2023/PITF/slides/polity5.dta")
overpolity <- left_join(tl, polity, by=c("ccode"="ccode", "year"="year"))

success <- overpolity %>%
  group_by(polity2) %>%
  summarise_at(c("tl_success"), sum) %>%
  filter(!is.na(polity2)) %>%
  mutate(outcome = "succeeded") %>%
  mutate(events=tl_success)%>%
  subset(select = -c(tl_success))

fail <- overpolity %>%
  group_by(polity2) %>%
  summarise_at(c("tl_failed"), sum) %>%
  filter(!is.na(polity2))  %>%
  mutate(outcome = "failed") %>%
  mutate(events=tl_failed) %>%
  subset(select = -c(tl_failed))

tlpolity <- rbind(success, fail)

ggplot(tlpolity, aes(fill=outcome, y=events, x=polity2)) +
  geom_bar(position="stack", stat="identity",)+
  scale_fill_manual(values=c("dark green", "light green")) +
  labs(x = "Polity", y= "Frequency", caption="Overstaying data (Versteeg et al. 2020)") +
  ggtitle("Overstay Attempts over Regime") +
  scale_x_continuous(breaks=seq(-10, 10, 1))

Thinking about data

  • what are the units? Why?
  • why are these variables in the data? Why not others?
  • what is purposely not in the data?
  • what in incidentally not in the data?
  • why were the data generated? Does that purpose fit your purpose?
  • are the data dynamic or static over time?
Back to top

References

Kennedy, Peter E. 2002. “Sinning in the Basement: What Are the Rules? The Ten Commandments of Applied Econometrics.” Journal of Economic Surveys 16 (4): 569–89.
Lake, David A. 1992. “Powerful Pacifists: Democratic States and War.” American Political Science Review 86 (1): 24–37.